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The problem of structure estimation in latent graphical models 
is considered, where some nodes are latent or hidden. We charac- 
terize conditions for tractable graph estimation and develop efficient 
methods with provable guarantees. We consider models where the 
j-». . underlying Markov graph is locally tree-like and the model is in the 

^sj ' regime of correlation decay. For the special case of the Ising model, 

the number of samples n required for structural consistency of our 
i—j • method scales as n = ^(^min ^ogp), where Smin is the mini- 

mum edge potential, S is the depth (i.e., distance from a hidden node 
VQ ■ to the nearest observed nodes), and 77 is a parameter which depends 

on the bounds on node and edge potentials in the Ising model. Nec- 
essary conditions for structural consistency under any algorithm are 
derived and our method nearly matches the lower bound on sample 
requirements. Further, the proposed method is practical to implement 
and provides flexibility to control the number of latent variables and 
the cycle lengths in the output graph. 
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1. Introduction. It is widely recognized that the process of fitting the observed samples to 
a statistical model needs to incorporate latent or hidden factors, which are not directly observed. 
^ , Learning latent variable models from observed samples involves mainly two tasks: discovering rela- 

^^ I tionships between the observed and hidden variables, and estimating the strength of such relation- 

ships. 

cn ■ One of the simplest latent variable model is the so-called latent class model or naive Bayes model, 

fT^ \ where the observed variables are conditionally independent given the state of the latent factor. An 

^^ ' extension of these models are the class of latent tree models with many hidden variables forming a 

I . tree hierarchy. Latent tree models have been effective in modeling data in a variety of domains, such 

as the evolutionary process which gave rise to the present-day species in bio-informatics (popularly 

known as phylogenetic tree models) [23, 48], for financial and topic modeling [19], and for modeling 

K> ■ contextual information for object recognition in computer vision [18]. There has been extensive 

C^ I work on learning latent tree models (e.g. [19, 25, 39]), where it is demonstrated that latent tree 

models can be learnt efficiently in high dimensions. In other words, the number of samples required 

for consistent learning is much smaller than the number of variables at hand. Moreover, inference 

on latent tree models is computationally tractable and can be carried out using simple algorithms 

such as belief propagation. 

Despite all the above advantages, latent tree models may not be suitable in all scenarios and 
the assumption of a tree structure may be too restrictive. For instance, when latent trees are used 
to model topic-word relationships, the hypothesis is that the topics (which are hidden) and words 
are Markov on a tree. In other words, latent tree models posit that the words are generated from 
a single topic, while, in reality there are common words across topics. Loopy graphical models 
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are able to capture such relationships and we consider learning such models, while retaining many 
advantages of the latent tree models. 

Relaxing the tree assumption leads to non-trivial challenges: in general, learning these models 
is NP-hard [9, 32], even when there are no latent variables, and developing methods for learning 
such fully observed models is itself an area of active research (e.g. [4, 31, 45]). In this paper, we 
consider structure estimation in latent graphical models Markov on locally tree-like graphs, meaning 
that local neighborhoods in the graph do not contain cycles. Learning such graphs has many non- 
trivial challenges: are there parameters regimes where these models can be learnt consistently and 
efficiently? If so, are there practical learning algorithms? Are learning guarantees for loopy models 
comparable to those for latent trees? How does learning depend on various graph attributes such as 
node degrees, girth of the graph, and so on? We provide answers to these questions in this paper. 

1.1. Our Approach and Contributions. We consider learning latent graphical models Markov on 
locally tree-like graphs in the regime of correlation decay. In this regime, there are no long-range 
correlations, and the local statistics converge to a tree limit. The implication of correlation decay 
is immediately clear: we can employ the available latent tree methods to learn "local" subgraphs 
consistently, as long as they do not contain any cycles. However, a non-trivial challenge remains: 
how does one merge these estimated local subgraphs (i.e., latent trees) to obtain an overall graph 
estimate? Specifically, merging involves matching latent nodes across different latent tree estimates, 
and it is not clear if this can be performed in an efficient manner. 

We employ a different philosophy for building locally tree-like graphs with latent variables. We 
decouple the process of introducing cycles and latent variables in the output model. We initialize a 
loopy graph consisting of only the observed variables, and then iteratively add latent variables to 
local neighborhoods of the graph. We establish correctness of our method under a set of natural 
conditions. 

We provide precise conditions for structural consistency of LocalCLGrouping under the probably 
approximately correct (PAC) model of learning [33] for general discrete models. We simplify these 
conditions for the Ising model, where each node is a binary random variable, to obtain better 
intuitions. We establish that for structural consistency, the number of samples is required to scale 
as n = ^(^jnin logp)) where p is the number of observed variables, ^min is the minimum edge 

potential, 6 is the depth (i.e., graph distance from a hidden node to the nearest observed nodes), 
and r/ is a parameter which depends on the minimum and maximum node and edge potentials of 
the Ising model (ry = 1 for homogeneous models). When there are no hidden variables {5 = 1), 
the sample complexity is strengthened to n = ri(0~j^logp), which matches with the best known 
sample complexity for learning fully-observed Ising models [4, 31]. 

We also establish necessary conditions for any (deterministic) algorithm to recover the graph 
structure, and establish that n = 0(AjninP^^ logp) samples are necessary for structural consistency, 
where Amm is the minimum degree and p is the fraction of observed nodes. This is comparable to 
the requirement of the proposed method under uniform node sampling (i.e., selecting the observed 
nodes uniformly), given by n = ^{A^^^p^^ [log p)^) . Thus, our method is competitive with respect 
to the lower bound on learning. 

Our proposed method has a number of attractive features for practical implementation: the 
method is amenable to parallelization which makes it efficient on large datasets. The method pro- 
vides flexibility to control the length of cycles and the number of latent variables introduced in 
the output model. The method can incorporate penalty scores such as the Bayesian information 
criterion (BIG) [47] to tradeoff model complexity and fidelity. Moreover, by controlling the cycle 
lengths in the output model, we can obtain models with good inference accuracy under simple algo- 
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rithms such as loopy belief propagation (LBP). Preliminary experiments on the newsgroup dataset 
suggests that the method can discover intuitive relationships efficiently, and also compares well 
with the popular latent Dirichlet allocation (LDA) [8] in terms of topic coherence and perplexity. 

1.2. Related Work. The classical latent cluster models (LCM) consists of multivariate distribu- 
tions with a single latent variable and the observed variables are conditionally independent under 
each state of the latent variable [36]. Hierarchical latent class (HLC) models [17, 52, 53] general- 
ize these models by allowing multiple latent variables. However, the proposed learning algorithms 
are based on greedy local search in a high-dimensional space, which is computationally expensive. 
Moreover, the algorithms do not have theoretical guarantees. Similar shortcomings also hold for 
expectation-maximization (EM) based approaches [24, 34]. Learning latent trees has been studied 
extensively before, mainly in the context of phylogenetics. See [23, 48] for a thorough overview. 
Efficient algorithms with provable performance guarantees are available (e.g. [2, 19, 21, 25]). Our 
proposed method in this paper is inspired by [19]. 

Works on high-dimensional graphical model selection are more recent. The approaches can be 
mainly classified into two groups: non-convex local approaches [4, 11, 31, 41] and those based on 
convex optimization [15, 37, 45, 46]. There is a general agreement that the success of these methods 
is related to the presence of correlation decay in the model [4, 6]. This work makes the connection 
explicit: it relates the extent of correlation decay (i.e., the convergence rate to the tree limit) with 
the learning efficiency for latent models on large girth graphs. An analogous study of the effect of 
correlation decay for learning fully observed models is presented in [4]. 

This paper is the first work to provide provable guarantees for learning discrete latent models 
on loopy graphs in high dimensions (which can also be easily be extended to Gaussian models, see 
remarks following Theorem 2). The work in [16] considers learning latent Gaussian graphical models 
using a convex relaxation method. However, the method cannot be easily extended to discrete 
models. Moreover, the "incoherence" conditions required for the success of convex methods are 
hard to interpret and verify in general. In contrast, our conditions for success are transparent and 
based on the presence of correlation decay in the model. The work in [11] considers graphical model 
selection with hidden variables, but proposes learning Markov graph of marginal distribution (upon 
marginalizing the hidden variables) and then replacing the cliques in the estimated graphs with 
hidden variables. Sample complexity results are not provided, and the method performs poorly in 
high dimensions, since it aims to estimate dense graphs. 

In [1], the problem of network tomography on locally tree- like graphs is considered, where the 
task is to estimate the graph using end-to-end path-based measurements (e.g. delay, link loss rate). 
It is established that a decaying fraction of participants is sufficient to learn the underlying graph. 
This paper has a different model, which is more challenging since here, we do not have a simple 
additive metric along the paths in the graph. 

2. System Model. 

2.1. Graphical Models. A graphical model is a family of multivariate distributions which are 
Markov in accordance to a particular undirected graph [35]. Each node in the graph i € W^ is 
associated to a random variable Xi taking value in a set X. We consider discrete graphical models 
where Af is a finite set. The set of edges E captures the set of conditional independence relations 
among the random variables. We say that a set of random variables Ji-w '■= {Xi,i G W} with 
probability mass function (pmf) P is Markov on the graph G if the local Markov property 

(1) P{Xi\x^(i)) = P{Xi\xw\i) 
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holds for all nodes i S W, where M{i) are the neighbors of node i in graph G. More generally, we 
say that P satisfies the global Markov property, if for all disjoint sets A,B C W, we have 

(2) P(xa,xb|x5) = P(xa|x5)P(xb|xs). 

where the set 5 is a separator^ between A and B. The local and global Markov properties are 
equivalent under the positivity condition, given by P(xvk) > 0, for all x^y E Afl^l [35], and we 
consider such distributions. 

The Hammersley-Clifford theorem [10] states that under the positivity condition, a distribution 
P satisfies the Markov property according to a graph G iff. it factorizes according to the cliques of 
G, and we can write it in the exponential form as 



(3) P(x) = exp V0,(x,)-A(0) 




where C is the set of cliques of G and Xc is the set of random variables on clique c. The quantity 
A{0) is known as the log-partition function and serves to normalize the probability distribution. 
The functions 6c are known as potential functions and correspond to the canonical parameters of 
the exponential family. 

A special case is the Ising model, which is the pairwise model over binary variables {—1, +1}^ 
with the probability mass function (pmf) given by 

(4) P{xw) = exp ( ^ OijXiXj + Y^ (^iXi - A{e) J . 

We specialize some of our results to the class of Ising models. 

We consider latent graphical models in which a subset of nodes is latent or hidden. Let H CW 
denote the hidden nodes and V C W denote the observed nodes. Our goal is to discover the presence 
of hidden variables X/f and learn the unknown graph structure G(W), given n i.i.d. samples from 
observed variables Xy. Let p := \V\ denote the number of observed nodes and m := \W\ denote 
the total number of nodes. 

2.2. Tractable Graph Families: Girth-Constrained Graphs. In general, structure estimation of 
graphical models is NP-hard [9, 32]. We now characterize a tractable class of models for which we 
can provide guarantees on graph estimation. 

We consider the family of graphs with a bound on the girth, which is the length of the shortest 
cycle in the graph. Let SGirth("i;5) denote the ensemble of graphs with girth at least g. There are 
many graph constructions which lead to a bound on girth. For example, the bipartite Ramanujan 
graph [20, p. 107] and the random Cayley graphs [28] have bounds on the girth. Recently, efficient 
algorithms have been proposed to generate large girth graphs efficiently [5]. 

Although girth-constrained graphs are locally tree-like, in general, their global structure makes 
them hard instances for learning. Specifically, girth-constrained graphs have a large tree-width: it 
is known that a graph with average degree at least Aavg and girth at least g has a tree width 
as Q, i — rY(Aavg — l)Lw-i)/2J J ^]^4j_ Thus, learning is non-trivial for graphical models Markov on 
girth-constrained graphs, even when there are no latent variables due to their large treewidth [32]. 



^A set 5 C W is a separator for sets A and B if the removal of nodes in 5* separates A and B into distinct 
components. 
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2.3. Local Convergence to a Tree Limit. This work establishes tractable learning when the 
graphical model converges locally to a tree limit. A sufficient condition for the existence of such 
limits is the regime of correlation decay^, which refers to the property that there are no long-range 
correlations in the model [29, 38, 51]. This regime is also known as the uniqueness regime since 
under such an assumption, the marginal distribution at a node is asymptotically independent of 
the configuration of a growing boundary. 

We tailor the definition of correlation decay to node neighborhoods and provide the definition 
below. Given a graph G = {W, E) and a graphical model Pxw\G Markov on it, and any subset 
A C W, let Pxa\g denote the marginal distribution of variables in A. For some subgraph F C G, 
let -PxaI-F denote the marginal distribution on A corresponding to a graphical model Markov on 
graph F instead of G (i.e., by setting the potentials of edges in G \ F to zero). Let Af[i]G] := 
Af{i;G) U i denote the closed neighborhood of node i in G. For any two sets Ai,A2 C W, let 
dist(74i, ^2) := minjgyi^jgyij dist(i, j) denote the minimum graph distance'^. Let Bi{i) denote the 
set of nodes within graph distance / from node i and dBi{i) denote the boundary nodes, i.e., 
exactly at / from node i. Let Fi{i;G) := G{Bi{i)) denote the induced subgraph on Bi{i). For any 
distributions P, Q, let \\P — Q\\^ denote the ii norm. 

Definition 1 (Correlation Decay). A graphical model -Pxw \G Markov on graph Gm = iWmi Em) 
is said to exhibit correlation decay with a non-increasing rate function Cm(") > if for all /,?tt, € N, 

(5) WPXMG^ - PxA|F,(^;G,„)lll < Cm{dist{A, dBi{i))), Vi E W^, A C Bi{i). 

In words, the total variation distance^ between the marginal distribution of a set A of a graphical 
model Markov on Gm and the corresponding model Markov on subgraph Fi{i;Gm) decays as a 
function of the graph distance to the boundary. This implies that for a class of functions Cm.(')) 
the eff^ect of graph configuration beyond / hops from any node i has a decaying effect on the local 
marginal distributions. 

For the class of Ising models in (4), the regime of correlation decay can be explicitly characterized, 
in terms of the maximum edge potential and the maximum degree of the graph, and this is studied 
in Section 3.2. 

3. Method and Guarantees for Structure Estimation. 

3.1. Overview of Algorithm. We now describe our algorithm, termed as LocalCLGrouping, for 
structure estimation of latent graphical models Markov on girth-constrained graphs. The algorithm 
leverages on the Chow-Liu grouping algorithm developed for latent tree models [19], described in 
Appendix A.l. The main intuition for learning a girth-constrained graph is based on reconstructing 
"local" parts of the graph which are acyclic and piecing them together. However, this approach 
has many challenges. First, it is not clear if the local acyclic pieces can be learnt efficiently since it 
requires the presence of an additive tree metric. This is addressed by considering models satisfying 
correlation decay (see Section 2.3). Second and a harder challenge involves merging the recon- 
structed local latent trees with provable guarantees due to the introduction of unlabeled latent 



^Technically, correlation decay can be defined in multiple ways [38, p. 520] and the notion we use is the uniqueness 
or the weak spatial mixing condition. 

^We distinguish between the terms graph distance and information distances. The former refers to the least number 
of hops on the graph, while the latter refers to the quantity in (6). 

^Recall that the total variation distance between two probability distributions P, Q on the same alphabet is given 
hyUP-Qh- 



Algorithm 1 LocalCLGrouping(d"(y), A, T,r) for graph estimation using distance estimates 
d"(y) := {d{i,j)}ij^Vj confidence interval A, threshold r and distance parameter r. 

Input: Distance estimates between the observed nodes d"{V) := {d{i,j)}ij^v, confidence parameter A, threshold 
T and bound r on distances used for local reconstruction. Let Br(w;d") := {u : dr'{u,v) < r} and MST(A;d") 
denotes the minimum spanning tree over A C V based on edge weights d"(A). Given a graph G, let Leaf(G) denote 
the set of nodes with unit degree. Let Af[i; G] denote the closed neighborhood of node i in graph G. RG{d" (A) , A, t) 
represents the recursive grouping method for building latent trees (see Appendix A.l) over the set of nodes A using 
distance estimates d"{A) with confidence bound A and threshold r for merging nodes. 
for V £ V do 

T, ^MST(B,(u);d"). 
end for 

Initialize G, Go <- U„T„. 
for u G y \ Leaf (Go) do 

A^Af[v;G]. 

S^ RG(d"(A),A,T). 

G{A) <— S (Replace subgraph over A with S in G) 
end for 
Output G. 

nodes in different pieces. We circumvent this challenge by leveraging on the Chow-Liu grouping 
algorithm [19] and merging the different pieces before introducing the latent nodes. 

The algorithm is described in Algorithm 1. Let d"'{i,j) denote the estimated distance between 
nodes i and j according to (43) using the empirical distribution PjJ. , computed using n samples, 
i.e., 

(6) r{ij) := -log\det{P^J\, yijGV. 

The set of distance estimates d"(y) := {d'^''{i,j) : i,j (^ V} are input to the algorithm along with 
a parameter r. Recall that i?r(^;d"(y)) := {j : d^{i,j) < r}. For each observed node i £ V, the 
set of nodes Br{i;d^{V)) is considered, and the minimum spanning tree is constructed. The graph 
estimate G" is initialized by taking the union of all the local minimum spanning trees. The latent 
nodes are now iteratively added by considering local neighborhoods of G and using any latent tree 
algorithm for reconstruction (e.g. [19, 39]). Note that the running time is polynomial (in the number 
of nodes) as long as polynomial time algorithms are employed for local latent tree reconstruction. 

The proposed method is efficient for practical implementation due to the "divide and conquer" 
feature, i.e., the local latent tree building operations can be parallelized to obtain speedups. For 
real datasets, a tradeoff between model complexity and fidelity is typically enforced by optimizing 
scores such as the Bayesian information criterion (BIC) [47]. Such criteria can be easily enforced 
through a greedy local search in each iteration of our method, and this limits the number of hidden 
variables added by our method. In our experiments in Section 5, we found that this method is fast 
to implement on real and synthetic datasets. 

We subsequently establish the correctness of the proposed method under a set of natural condi- 
tions. We require that the parameter r, which determines the set Br{i;d) for each node i, needs 
to be chosen as a function of the depth 6 (i.e., distance from a hidden node to its closest observed 
nodes) and girth g of the graph. In practice, the parameter r provides fiexibility in tuning the length 
of cycles added to the graph estimate. When r is large enough, we obtain a latent tree, while for 
small r, the graph estimate can contain many short cycles (and potentially many components). In 
experiments, we evaluate the performance of our method for different values of r. For more details, 
see Section 5. 
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(a) Loopy latent graphical model (b) Local nbds are tree-like 



(c) Local MST 






(d) Merging of local MSTs 



(e) Local nbd on merged graph 



(f) Latent tree on local nbd 



Fig 1. Various steps of LocalCLGrouping method on a simple cycle, where observed variables are shaded. See Sec- 
tion 3.1.1. 



3.1.1. Toy Example with a Single Cycle. To demonstrate the steps of the above proposed 
method, consider the simple case of a single cycle of length 5, where all the nodes on the cycle 
are hidden and each hidden node has two observed leaves, as shown in Fig. la. When the cycle 
length g is sufficiently large, information distances on local neighborhoods are approximately ad- 
ditively, as depicted in Fig.lb. Moreover, in Fig. lb, let " * " denote the observed node closest 
to each hidden node (termed as its surrogate), in terms of information distance. The mini- 
mum spanning tree over the set of four nodes, which are zoomed in, corresponds to a chain 
shown in Fig.lc. Similarly, if in different local neighborhoods of observed nodes (based on a 
threshold on information distances), the surrogate relationships are similar (i.e., every hid- 
den node has one of its children as its surrogate), then the local MSTs are simple chains, 
and their merging gives rise to graph G in Fig. Id. Now if a local neighborhood is selected on 
the merged graph G, as shown in Fig.le, then we can discover the local latent tree structure 
based on information distances as shown in Fig. If, since they are approximately additive. 
Similarly, when different neighborhoods on G are selected, local latent trees are discovered, 
and we recover the latent cycle graph in Fig. la in the end. 

3.2. Results for Ising Models. We first limit ourselves to providing asymptotic guarantees 
for the Ising model in (7), and then extend the results to non-asymptotic guarantees in 
general discrete distributions. Recall that the Ising model is a pairwise model over binary 
variables {—1,-1-1}^ with the probability mass function (pmf) given by 



(7) 



P(xvk) = exp I ^ OijXiXj + ^ (j)iXi - A{e) 

VeGB i£V / 



3.2.1. Conditions for Recovery in Ising Models. We present a set of natural conditions on 
the graph structure and model parameters under which our proposed method succeeds in 
structure estimation. 

(Al) Minimum Degree of Latent Nodes: We require that all latent nodes have degree 

at least three, which is a natural assumption for identifiability of hidden variables. 

Otherwise, the latent nodes can be marginalized to obtain an equivalent representation 

of the observed statistics. 
(A2) Distance Bounds: Assume bounds on the edge potentials 6 := {Oij} of the Ising 

model: 

(8) ^min<|^ij|<^max, V(i,j)eG. 

Similarly assume bounded node potentials. We now define certain quantities which de- 
pend on the edge potential bounds. Given an Ising model P with edge potentials 6 = 
{9ij} and node potentials </> = {(pi}, consider its attractive counterpart P with edge po- 
tentials 6 := {l^ijl} and node potentials (/) := {|0i|}. Let 0max •= maxjgy atanh(]E(Xj)), 
where E is the expectation with respect to the distribution P. Let P(Xi2; {^,0i,02}) 
denote an Ising model on two nodes {1,2} with edge potential 6 and node potentials 
{01? 02}- Our learning guarantees depend on rfmin and d^-^ax satisfying 

(9) C^min > - log I det P(Xi,2; {^max, 0max? 0ma^}) I , 

(10) rfn,ax < - log I det P(Xi,2; {^min, 0, 0}) | , 

(11) 77 '^■"^^ 



d- 



min 



(A3) Correlation Decay: We assume correlation decay in the Ising model and require that 

(12) a := Aniaxtanh^ina^ < 1, . ^^^^^ = o(l), 

inin 

where Amax is the maximum node degree, g is the girth and ^mini ^max are the minimum 
and maximum (absolute) edge potentials in the model. 
(A4) Girth vs. Depth: The depth S characterizes how close the latent nodes are to observed 
nodes on graph G: for each hidden node h E H, find a set of four observed nodes which 
form the shortest quartet with h as one of the middle nodes, and consider the largest 
graph distance in that quartet. The depth 5 is the worst-case distance over all hidden 
nodes. We require the following tradeoff between the girth g and the depth 5: 

(13) |_5^(^ + l)=c^(l), 
Further, the parameter r in our algorithm is chosen as 

(14) r > (5 (?7 + 1) (imax + e, for some e > 0, -dmhi- r = uj{l). 

{Al) is a natural assumption on the minimum degree of the hidden nodes for identifiability. 
(A2) relates certain distance bounds to bounds on edge potentials. Intuitively, dmin and rfmax 



are bounds on information distances given by the local tree approximation of the loopy model, 
and its precise definition is given in (18). Note that e"'^""-'= = Q{6min) and e"''™''^ = 0(6'inax)- 
{A3) uses bounds on the edge potentials to impose correlation decay on the model. It is 
natural that the sample requirement of any graph estimation algorithm depends on the 
"weakest" edge characterized by the minimum edge potential ^min- Further, the maximum 
edge potential ^max characterizes the presence/absence of long range correlations in the 
model. Intuitively, there is a tradeoff between the maximum degree A^ax and the maximum 
edge potential 6'max of the model. Moreover, {A3) prescribes that the extent of correlation 
decay be strong enough (i.e., a small a and a large enough girth g) compared to the weakest 
edge in the model. Similar conditions have been imposed before for graphical model selection 
in the regime of correlation decay when there are no hidden variables [4]. {AA) provides the 
tradeoff between the girth g and the depth 6. Intuitively, the depth needs to be smaller than 
the girth to avoid encountering cycles during the process of graph reconstruction. Recall that 
the parameter r in our algorithm determines the neighborhood over which local MSTs are 
built in the first step. It is chosen such that it is roughly larger than the depth S in order 
for all the hidden nodes to be discovered. The upper bound on r ensures that the distortion 
from an additive metric is not too large. The parameters for latent tree learning routines 
(such as confidence intervals for quartet tests) are chosen appropriately depending on dmin 
and (imax- See Section 3.3. 

3.2.2. Guarantees for Ising Models. We now establish that the proposed method correctly 
estimates the graph structure of an Ising model in high dimensions. Recall that 5 is the depth 
(distance from a hidden node to its closest observed nodes), 6'inin is the minimum (absolute) 
edge potential and ri = ^^^^ is the ratio of distance bounds. 

Theorem 1 (Structural Consistency for Ising Models). Under {A1)-{AA), the probability 
that the proposed method is structurally consistent tends to one, when the number of samples 
scales as 

(15) n = f](0^^+^^"'logP 

Proof: See Appendix B. □ 

Remarks: . 

1. Thus, for learning Ising models on locally tree-like graphs, the sample complexity is 
dependent both on the minimum edge potential 6'min and on the depth 6. Our method 
is efficient in high dimensions since the sample requirement is only logarithmic in the 
number of nodes p. 

2. Dependence on Maximum Degree: For the correlation decay to hold {A3), we 
require 6'min < 6'max = ©(l/Amax)- This implies that the sample complexity is at least 

n = fi(A^ir^)+'iogp). 

3. Comparison with Fully Observed Models: In the special case when all the nodes 
are observed^ (5 = 1), we strengthen the results for our method and establish that the 



''In the trivial case, when all the nodes are observed and the graph is locally tree-like, our method reduces 
to thresholding of information distances at each node, and building local MSTs. The threshold can be chosen as 
r = dmax + e, for some e > 0. 
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sample complexity is n = fl{9^^^ logp). This matches the best known sample complexity 
for learning fully observed Ising models [4, 31]. 
4. Comparison with Learning Latent Trees: Our method is an extension of latent 
tree methods for learning locally tree-like graphs. The sample complexity of our method 
matches the sample requirements for learning general latent tree models [19, 25, 39]. 
Thus, we establish that learning locally tree-like graphs is akin to learning latent trees 
in the regime of correlation decay. 

3.3. Extension to General Discrete Models. We now extend the results to general discrete 
models and provide non-asymptotic sample requirement guarantees for success of our pro- 
posed method. 

Local Tree Approximation: . We first define the notion of a local tree metric dtree(V^) computed 
by limiting the model to acyclic neighborhood subgraphs between the respective node pairs. 
Given a graph G = {W, E), let tree(i, j; G) := G{Bi{i)UBi{j)), for / = [g/2\ - 1, denote the 
induced subgraph on Bi{i) U Bi{j), where g is the girth of the graph. Recall that Bi{i;G) 
denotes the set of nodes within graph distance / from i in G. When / < g/2 — 1 no cycles 
are encountered and thus the induced subgraph tree(i,j;G) is acyclic. Recall that Px^ |g 
denotes the pairwise marginal distribution between i and j induced by the graphical model 
P(xvy) on graph G(W). Let Pxi,j|trce(j,i) denote the pairwise marginal distribution between 
i and j induced by considering only the subgraph tree(z, j; G) C G. Denote 

(16) d{i,j;tree) := - log | det Px,,, | tree(i,i) I • 

(17) d(z,j;G'):=-log|detPx,„|G|. 

Denote dtrce(^) := {d{i,j;tTee) : i,j G V} and d{V) := {d{i,j;G) : i,j e V}. Note that for 
loopy graphs in general, d{i,j]G) is different from (i(i, j; tree). The learner has access only 
to the empirical versions d(V") of the distances d{V), and thus the learner cannot estimate 
dtree(^)- However, we use dtree(^) to characterize the performance of our algorithm, we list 
the relevant assumptions below. 

3.3.1. Conditions on the Model Parameters. 

(PI) Minimum Degree: The minimum degree of any hidden node in the graph is Degj^^j^{H) > 
3. 

(P2) Bounds on Local Tree Metric: Given a graphical model P:k_w\g Markov on graph 
G, the pairwise marginal distribution Px^ |trce(jj) between any two neighbors {i,j) € G 
are non-singular^ and the distances (i(z,j; tree) := — log| detPx, |tree(jj)| satisfy 

(18) < d^in < c?(z,j;tree) < rf^ax < oo, V(^,j) G G(H^), t] := ^, 

for suitable parameters (imin and (imax- We explicitly characterize d^^^ and (imax for Ising 
models. 



®Note that Px^ , ^^^^ ) for (i, j) £ G{W) is the probabihty distribution obtained by retaining only the node potentials 
i and (j>j, and the edge potential 9ij and removing rest of the nodes. The distance is given by d(j,j;tree) := 

-log|detPxi J|trcc(^,J)l• 
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(S3) Regime of Correlation Decay: The pairwise statistics of the graphical model con- 
verge locally to a tree limit according to Definition 1 with function Cm(") in (5) satisfying 

(19) 0<cf?-3^-ll< "" 



2d- / lAfP 

^i "-mm / \'^\ 

where g is the girth, r is the distance bound parameter in LocalCLGrouping, lA"! is the 
dimension of each variable, dmin, c^max are the distance bounds in (18) and 

(20) V := min (d^^, 0.5e-'^(e'^-'" - 1), e~°"''"^''^^+'\ ^d^^ - r,r - d^^J{7] + 1)) . 

(54) Confidence Interval for Quartet Test: The confidence interval in Quartet(d,A) 
routine in Algorithm 2 is chosen as 

dm.., r ^2) 



(21) A = exp 



'dr, 



{B5) Threshold for Merging Nodes: The threshold r in RG(d, A,r) routine in Algo- 
rithm 3 is chosen as 

(22) r = ^-\X\'ul-l)>0, 

where \X\ is the dimension of the variable at each node and Cm(') is the correlation 
decay function according to (5). 

(-B1) is a natural assumption on the minimum degree of the hidden nodes for identifiability, 
which is also needed for latent trees. The assumption (-B2) states that every edge has bounded 
distances under local tree approximations. Recall that in the special case of Ising models, 
this can be expressed via bounds on edge potentials. The assumption (-B3) on correlation 
decay imposes a constraint on the rate function ({■), in terms of the girth of the graph g, the 
distance threshold r used by the proposed method, the distance bounds d^i^ and (imax and 
depth 6. Recall that the depth 6 characterizes how close the latent nodes are to observed 
nodes on graph G: for each hidden node h & H, find a set of four observed nodes which 
form the shortest quartet with h as one of the middle nodes, and consider the largest graph 
distance in that quartet. The depth 6 is the worst-case distance over all hidden nodes. (-B3) 
imphes that we require that the depth 6 satisfies 

(23) -rfmin > (5 (?7 + 1) t/max- 

Similarly, (-B3) imposes constraints on the parameter r used by the proposed algorithm for 
building local minimum spanning trees in the first step. (-B3) implies that r needs to be 
chosen as 

(24) 6{r] + l) <,ax <r< |Cin - r. 

Intuitively, the above constraint implies that r is relatively small compared to the girth of the 
graph and large enough for every hidden node to be discovered. This enables the proposed 
algorithm to correct reconstruct latent trees locally. 

The confidence interval constraint in (54) is based on the concentration bounds for the 
empirical distances. The threshold for merging nodes in (-B5) ensures that spurious hidden 
nodes are not added. These conditions are inherited from latent tree algorithms. 

11 



3.4. Guarantees for the Proposed Method. We now establish that the LocalCLGrouping algo- 
rithm is structurally consistent under the above conditions. 

Theorem 2 (Structural Consistency of LocalCLGrouping). Under assumptions (-B1)- 
(S5), the LocalCLGrouping algorithm is structurally consistent with probability at least 1 — k, 
for any k, > 0, when the number of samples n available for learning satisfies 

•''' " > (.„ - \a:kS- ^ - D? (-'■°i^P+ l-y|'°g2 -logf ) . 

^ "min 

where v is given by (20). 

Remarks: 

1. Thus, we provide PAC guarantees for reconstructing latent graphical models on girth- 
constrained graphs. The conditions required for success are mild and transparent, and 
along the lines of the conditions required for learning latent tree models. The condi- 
tions imposed on the girth of the graph are relatively mild. We require that the girth 
be roughly larger than the depth and that the correlation decay function Cm(') be suf- 
ficiently strong (-B3). Thus, learning girth-constrained graphs is akin to learning latent 
tree models (in terms of sample and computational complexities) under a wide range 
of conditions. 

2. One notable additional condition required for learning girth-constrained graphs in con- 
trast to latent trees is the requirement of correlation decay (-B3). However, we note that 
this is only a sufficient condition, and not necessary for learnability. For instance, the 
result in [22] establishes that the pairwise statistics converges locally to a tree limit for 
all attractive Ising models with strictly positive node potentials, but without any ad- 
ditional constraints on the parameters. Our results and analysis hold in such scenarios 
since we only require local convergence to a tree metric. 

3. The results above are applicable for discrete models but can be extended to Gaussian 
models using the notion of walk-summability in place of correlation decay according to 
(5) (see [3]) and the negative logarithm of the correlation coefficient as the distance 
metric (see [19]). The results can also be extended to more general linear models such 
as multivariate Gaussian model, Gaussian mixtures and so on, along the lines of [2]. 

Proof: The detailed proof is given in Appendix C. It consists of the following main steps: 

1. We first prove correctness of LocalCLGrouping under the tree limit (i.e., distances 
dtree(^) := {d{i, j]tTee)}iji^v) and then show sample-based consistency. The latter is 
based on concentration bounds, along the lines of analysis for latent tree models [25, 39], 
with an additional distortion introduced due to the presence of a loopy graph. 

2. We now briefly describe the proof establishing the correctness of LocalCLGrouping al- 
gorithm under dtree in girth-constrained graphs. Intuitively, the distances d{i,j;tTee) 
correspond to a tree metric when the graph distance dist(z,j) < g/2 — 1, where g is 
the girth. Since LocalCLGrouping infers latent trees only locally, it avoids running into 
cycles and thus correctly reconstructs the local latent trees. The initialization step in 
LocalCLGrouping corresponds to the correct merge of this local latent trees under the 
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assumptions on parameter r in (24) and the correctness of LocalCLGrouping is estab- 
lished. 

D 

3.4.1. Guarantees under Uniform Sampling. We have so far given guarantees for graph re- 
construction, given an arbitrary set of observed nodes in the graph. We now speciahze the 
results to the case when there is a uniform sampling of nodes and provide learning guaran- 
tees. This analysis provides intuitions on the relationship between the fraction of sampled 
nodes and the resulting learning performance. 

Let Scirthl^; Qi Ajnin, Amax) dcuotc the eusemblc of graphs on m nodes with girth at least 
g and minimum degree Amin > 3 and maximum degree Amax- Let p := — denote the uniform 
sampling probability for selecting observed nodes. We have the following result on the depth 
6. Define a constant eo > as 

,^^, log(4mAmax(l-p)(^-'"'^^^'') 

[Zb) eo — . 

logm 

Lemma 1 (Depth Under Uniform Sampling). Given uniform sampling probability of p, 
for any e < max(0, eo), 

"log(4mi+^Amax 



(27) 6 < ,-,^1—, lo 



log(Ami„-l) 



log(l-p)| 



w.p. > 1 — m 



Proof: The proof is by straightforward arguments on binomial random variables and the 
union bound. See Appendix C.4. □ 

Remarks: 

1. Assuming that the girth satisfies g > 2(5(1 + (imax/c^min) w.h.p., when the sampling 
probability and the degrees are both constant, then 

(28) 

p = 6(1), Amin, Amax = 0(1) ^6 = C'(loglogm) ^n = fi (poly (log m)), w.h.p. 

On the other hand, with vanishing sampling probability, for /3 G [0, 1), we have 

(29) p = Q{m^~'), Amin, Amax = 0(1) ^S = O(logm) ^n = fi(poly(m)), w.h.p. 

2. Recall that for Ising models, the best-case sample complexity of LocalCLGrouping for 
structural consistency (when rj = 1 and 6'min = 6'max = ©(l/Amax)) scales as 

n = fi(A^f+i)logp). 

Thus, under uniform sampling, the sample complexity required for consistency scales 
as 



n 






For the special case when the graph is regular (Amin = Amax), this reduces to 
(30) n = f] (AL.P"'(logp)^) . 
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4. Necessary Conditions for Graph Estimation. We have so far provided sufficient 
conditions for recovering latent graphical models Markov on girth-constrained graphs. We 
now provide necessary conditions on the number of samples required by any algorithm to 
reconstruct the graph. Let Gn '■ (A*!^')" — )■ Q^ denote any deterministic graph estimator 
using n i.i.d. samples from node set V and Qm is the set of all possible graphs on m nodes. 
We ffist define the notion of the graph edit distance. 

Definition 2 (Edit Distance). Let G, G he two graphs^ with adjacency matrices A^, Ag, 
and let V he the set of labeled vertices in hoth the graphs (with identical lahels). Then the 
edit distance hetween G, G is defined as 

dist{G,G;V) := min ||Ag - 7r(AG)||i, 

n 

where vr is any permutation on the unlabeled nodes while keeping the labeled nodes fixed. 

In other words, the edit distance is the minimum number of entries that are different in 
Ag and in any permutation of Ac over the unlabeled nodes. In our context, the labeled 
nodes correspond to the observed nodes V while the unlabeled nodes correspond to latent 
nodes H. We now provide necessary conditions for graph reconstruction up to certain edit 
distance. 

Theorem 3 (Necessary Condition). For any deterministic estimator Gm '■ [X^ )" i-> Q^n 
based on n i. i. d. samples from m^ observed nodes /3 E [0,1] of a latent graphical model Markov 
on graph Gm G SGirth(^; 9, ^min; ^max) on m nodes with girth g, minimum degree ^ram and 
maximum degree A^ax; for all e > 0, we have 

(31) P[dist(G„, Gm, V)>em]>l- ^J,l ^^ , ^ , 

under any sampling process used to choose the observed nodes. 

Proof: The proof is based on counting arguments. See Section D for details. □ 

Remarks: 

1. The above result states that roughly 

(32) n = n{A^i^m^-^ logm) = fi (^^ logpj 

samples are required for structural consistency. Thus, when (3 = 1 (constant fraction 
of observed nodes), logarithmic number of samples are necessary while when /3 < 1 
(vanishing fraction of observed nodes), polynomial number of samples are necessary 
for reconstruction. From (30), recall that for Ising models, under uniform sampling of 
observed nodes, the best-case sample complexity of LocalCLGrouping (for homogeneous 
models on regular graphs withe degree A and 6'min = 6'max = ©(1/^)) scales as 

n = Q{A^p-\logpf), 

and thus, nearly matches the lower bound on sample complexity in (32). 



'^We consider inexact graph matching where the unlabeled nodes can be unmatched. This is done by adding required 
number of isolated unlabeled nodes in the other graph, and considering the modified adjacency matrices [13]. 
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5. Experiments. In this section we present experimental results on real and synthetic 
data. We evaluate performance in terms of perplexity, predictive perplexity and topic coher- 
ence, used frequently in topic modeling. In addition, we also study tradeoff between model 
complexity and data fitting through the Bayesian information criterion (BIG) [47]. Experi- 
ments are conducted using the 20 newsgroup data set, monthly stock returns from the S&P 
100 companies and synthetic data. The datasets, software code and results are available at 
http : //newport .eecs .uci .edu/anandkumar. 

5.1. Experimental Setup. 

Synthetic data: . We also generate samples from a known latent graphical model, shown 
previously in Fig.l, with a fixed depth 6 = 1, a fixed latent node degree A = 4, and different 
girths g = 10, 20, 30, .., 100. The node potentials are kept at zero, while the edge potentials 
are chosen randomly in the range [0.05,0.2]. This ensures that the model remains in the 
regime of correlation decay since the critical potential 6* = atanh(A~^) = 0.2554 > 0.2. 

Newsgroup data: . We employ latent graphical models for topic modeling, i.e., modeling the 
relationships between various words co-occurring in documents. Each hidden variable in the 
model can be thought of as representing a topic, and topics and words in a document are 
drawn jointly from the graphical model. For a latent tree graphical model, topics and words 
are constrained to form a tree, while loopy models relax this assumption. We consider 16,242 
binary samples of 100 keywords selected from the 20 newsgroup data. Each binary sample 
indicates the appearance of the given words in each posting. These samples are divided in 
to two equal groups, training and test sets for learning and testing purposes. 

S&!,P data: . We also employ latent graphical models for financial modeling and in particular, 
for estimating the dependencies between the stock trends of different companies. The data 
set consists of monthly stock returns of 84 companies'^ listed in S&P 100 index from 1990 
to 2007. Experiments with this dataset allows us to demonstrate the performance of our 
algorithm on data using a Gaussian graphical model. 

Methods: . We consider a regularized variant of the method proposed earlier for latent 
graphical model selection. Here, in every iteration, the decision to add hidden variables to 
a local neighborhood is based on the improvement of the overall BIG score. This allows us 
to tradeoff model complexity and data fitting. In addition, we obtain better generalization 
by avoiding overfitting. Note that our proposed method only deals with structure estimation 
and we use expectation maximization (EM) for parameter estimation. For the newsgroup 
data we compare the proposed method with the LDA model^. 

Implementation: . The above method is implemented in MATLAB. We used the modules 
for LBP, made available with UGM^° package. The LDA models are learnt using the Ida 
package^^. 



The 16 companies added after 1990 are dropped from the hst of 100 companies listed in S&P 100 stock index for 
this analysis. 

^Typically, LDA models the counts of different words in documents. Here, since we have binary data, we consider 
a binary LDA model where the observed variables are binary. 

These codes can be downloaded from http://www.di.ens.fr/--mschinidt/Software/UGM.html 
http: //chasen. org/"-daiti-m/dist/lda/ 
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Threshold selection r for our method: . Recall that the parameter r in our method controls 
the size of neighborhoods over which the local MSTs are constructed in the first step of 
our method. We earlier presented ranges of r, where recovery of the loopy structure is 
theoretically guaranteed (w.h.p). However, in practice, this range is unknown, since the 
model parameters are unknown to the learner, and also since there is no ground truth with 
respect to real datasets. Here, we present intuitive criterion for selecting the threshold based 
on the BIG score. First, it is important to search for the optimal threshold in the correct 
range, and this range is given by 



(33) 



rmax := max d{i,j),r„ 

{i,j)£VxV 



max mine? (i, j] 

j&V i&V ^ ' 



if we disallow for disconnected components in the output graph. Note that if we choose 
f > ''^max, then the output is a latent tree. In our experiments, we choose one value above 
'"max to find a reference tree model and compare it with other outcomes. For the 20 news- 
group dataset, we find that rmjn = 2.3678 and r^ax = 12.2692. Therefore, we choose 
r G {3,5,7,9,11,13} for our experiments on newsgroup data. For the monthly stock re- 
turns data, Tmin = 1.0337 and rmax = 8.1172, and we choose r from 1.1 to 8.2. 



Performance Evaluation: 

by 

(34) 



We evaluate performance based on the test perplexity [42] given 



Perp-LL := exp 



-X^logP(x*-nA:)) 
np ^ 



where n is the number of test samples and p is the number of observed variables (i.e., words). 
Thus the perplexity is monotonically decreasing in the test likelihood and a lower perplexity 
indicates a better generalization performance. Along the lines of (34), we also evaluate the 
predictive perplexity [8] 



(35) 



Pred- Perp-LL := exp 



n 

— ^iogP(x;::vfc)ixr(fc)) 
^^ fc=i 



where a subset of word occurrences x*^^g* is observed in test data and the performance of 
predicting the rest of words is evaluated. In our experiments, we randomly select half the 
words in test samples. 

We also consider regularized versions of perplexity that capture tradeoff between model 
complexity and likelihood, given by 

1 



-— BIC(x''='*) 
np 



(36) Perp-BIC := exp 
where the BIG score [47] is defined as 

n 

(37) BIG(x'^^*) := ^logP(x*^^*(A;)) - 0.5(df) log 



n, 



where df is the degrees of freedom in the model. For a graphical model, we set df := 
m ^ \E\^ where m is the total number of variables (both observed and hidden) and \E\ is 
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the number of edges in the model. For the LDA model, we set df := {p{m — p) — 1), 
where p is the number of observed variables (i.e., words) and m — p is the number of hidden 
variables (i.e., topics). This is because a LDA model is parameterized by a p x [m — p) topic 
probability matrix and a {m — p)-length Dirichlet prior. Thus, the BIG perplexity in (36) 
is monotonically decreasing in the BIG score, and a lower BIG perplexity indicates better 
tradeoff between model complexity and data fitting. However, the likelihood and BIG score 
in (34) and (36) are not tractable for exact evaluation in general graphical models since 
they involve the partition function. We employ loopy belief propagation (LBP) to evaluate 
them^^. Note that it is exact on a tree model and approximate for loopy models. Along the 
lines of predictive perplexity in (35), we also consider its regularized version 



-DJ-^l-X-prcdl-^obs J 

np 



(38) Pred-Perp-BIG := exp 
where the conditional BIG score is given by 

n 

(39) BIG(x;-*,|xr) := J]logP(x*p-VA:)|x*t(^)) - 0.5(df) logn. 



fe=i 



In addition, we also evaluate topic coherence, frequently considered in topic modeling. It 
is based on the average pointwise mutual information (PMI) score 



(40) PMI:=-|-5^ Yl PMI(X,;X,), PMI(X,; X,) := log ^^^' ^'^- 



, ^ij 



A5\H\^ ^ —"^"^^"jn ^-.^^v--.,--,. • -°p(x, = 1)P(X, = 1)' 

' ' h&Hi,jGA{h) ^ * / V J J 

i<j 

where the set A{h) represents the "top-10" words associated with topic h E H. The number 
of such word pairs for each topic is (g) — 45, and is used for normalization. In [43], it is 
found that the PMI scores are a good measure of human evaluated topic coherence when 
it is computed using an external corpus. It is also observed that using a related external 
corpus gives a high PMI. Hence, in our experiments, we choose a corpus containing news 
articles from the NYT articles bag-of- words dataset. This dataset has a vocabulary of 102660 
words from 300,000 separate articles [27]. For LDA models, the top 10 words for each topic 
are selected based on the topic probability vector. For latent graphical models, we use the 
criterion of information distances on the learnt model to select the 10 nearest words for each 
topic. 

5.2. Experimental Results. 

Results for Synthetic Data: . Outcome of the proposed method on synthetic data with thresh- 
old levels closer to r^ax, defined in (33), discover the latent nodes closer to the actual numbers 
in the underlying model, whereas lower thresholds introduce more cycles and hidden vari- 
ables. This is intuitive and occurs with real datasets as well. The normalized BIG scores 
(normalized with respect to n and p) of the loopy graphs improve with the number of sam- 
ples n, as shown in Figure 2b. This is expected since the data becomes less noisy with 



^^The likelihood is evaluated using P(xv) = pfx ^ix" ) ' where P(xij|xv) and P{'x.vuh) are computed using LBP, 
which is exact for trees. The above expression holds for any configuration of hidden variables xh, however we use the 
most likely hidden state to avoid numerical issues. 
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^n = 500 
^n = 5000 



(a) Output under threshold r — 8 

Fig 2. Results for synthetic data with girth g — 10 using the pro' 
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(b) Normalized BIG vs. threshold r for different samples sizes 



method. 



Method 


r 


Hidden 


Edges 


PMl 


Perp-LL 


Perp-BIC 


Pred-Perp-LL 


Pred-Perp-BlC 


Proposed 


3 


55 


265 


0.2638 


1.1533 


1.1560 


1.0695 


1.0720 


Proposed 


5 


39 


293 


0.4875 


1.1567 


1.1594 


1.0424 


1.0448 


Proposed 


7 


32 


183 


0.4313 


1.1498 


1.1518 


1.0664 


1.0682 


Proposed 


9 


24 


129 


0.6037 


1.1543 


1.1560 


1.0780 


1.0795 


Proposed 


11 


26 


125 


0.4585 


1.1555 


1.1571 


1.0787 


1.0802 


Proposed 


13 


24 


123 


0.4289 


1.1560 


1.1576 


1.0788 


1.0803 


LDA 


NA 


10 


NA 


0.2921 


1.1480 


1.1544 


1.1623 


1.1656 


LDA 


NA 


20 


NA 


0.1919 


1.1348 


1.1474 


1.1572 


1.1638 


LDA 


NA 


30 


NA 


0.1653 


1.1421 


1.1612 


1.1616 


1.1715 


LDA 


NA 


40 


NA 


0.1470 


1.1494 


1.1752 


1.1634 


1.1767 



Table 1 

Comparison of proposed method under different thresholds (r) wtth LDA under different num,ber of topics (i.e., 

number of hidden variables) on 20 newsgroup data. For definition of perplexity and predictive perplexity based on 

test likelihood and BIC scores, and PMI, see (34), (35), (36), (38) and (40). 



more samples. Figure 2b shows an overall improvement in the normalized BIC score with 
increasing number of samples n for different thresholds r. Figure 2b shows the variation of 
normalized BIC scores for graphs learnt using thresholds r = 4 to 9 with girth g = 10. We 
observe that the normalized BIC score decreases for the lowest threshold (r = 4), where the 
output graph shows a significant increase in latent nodes and edges, resulting in overfitting, 
and higher thresholds have better BIC. However, once the threshold results in a tree model, 
the BIC degrades since the cycles are no longer present. 

Graph Structure for Newsgroup data: . We employ our method to learn the graph structures 
under different thresholds r G {3, 5, 7, 9, 11, 13} on newsgroup data, which controls the length 
of cycles. At r = 13 as shown in Fig 5, we obtain a latent tree and for r G {3, 5, 7, 9}, we 
obtain loopy models. The first long cycle appears at r = 9 shown in Fig 4. At r = 7, we find 
a combination of short and long cycles. We find that models with cycles are more effective in 
discovering intuitive relationships. For instance, in the latent tree (r = 13), the link between 
"computer" and "software" is missing due to the tree constraint, but is discovered when 
r < 9. Moreover, we see that common words across different topics tend to connect the local 
subgraphs. For instance, the word "program" is used in the context of both space program 
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windows 


bible 
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driver 
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windows 


graphics 


religion 
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pc 
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earth 


windows 


earth 


software 


version 


question 


disk 


shuttle 


scsi 


ftp 


fact 


science 


mars 


computer 


pc 


jews 


drive 


space 


system 


disk 


evidence 


university 



Table 2 

Top 10 topic words from selected topics in loopy graphical model with threshold r = 9, the topic number corresponds 

to the labels of hidden variables tn the loopy graph shown in Figure 4- 
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nasa 


files 


graphics 


world 


states 


insurance 


dos 


video 


fact 


research 


earth 


format 
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moon 


ftp 


windows 


jesus 


university 


orbit 


program 


computer 


religion 


mac 


mission 


software 


pc 


bible 


scsi 


launch 


win 


version 


evidence 


computer 


gun 


version 


software 


human 


system 


shuttle 


pc 


system 


question 


power 



Table 3 
Top 10 topic words corresponding to selected topics from the LDA model with 10 topics. 



and computer programs. Similarly, the word "earth" is used both in the context of religion 
and space exploration. 

Perplexity and Topic Coherence for Newsgroup Data: . In Table 1, we present results under 
our method and under LDA modeling on newsgroup data. For the LDA model, we vary 
the number of hidden variables (i.e., topics) as {10,20,30,40}. In contrast, our method is 
designed to optimize for the number of hidden variables, and does not need this input. 
We note that our method is competitive in terms of both predictive perplexity and topic 
coherence. We find that the topic coherence (i.e., PMI) for our method is optimal at r = 9, 
where the graph has a single long cycle and a few short cycles. Intuitively, this model is able 
to discover more relationships between words, which the latent tree (r = 13) is unable to do 
so. On the other hand, for r < 9, topic coherence is degraded which suggests that adding too 
many cycles is counterproductive. However, the model at r = 5 performs better in terms of 
predictive perplexity indicating that it is able to use evidence from more observed words for 
prediction on test data. Moreover, all of our latent graphical models outperform the LDA 
models in terms of predictive perplexity. The top 10 topic words for selected topics are given 
for our method at (r = 9) and for the LDA model (with 10 topics) are given in Table 2 and 
Table 3. 

Graph Structure for Stock Market Data: . The outcome of applying the proposed algorithm to 
stock market data is presented in Table 4. We observe that the number of edges and hidden 
variables remain fairly constant over a large range of thresholds. Specifically for r G [5.9, 6.7] U 
[6.8,7.7], we obtain the same graph structure (for r > rmax, we obtain a tree). Another 
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r 


Hidden 


Edges 


Perp-LL 


Perp-BIC 


2.7 


35 


154 


1.9498 


2.0296 


3.9 


39 


139 


2.0200 


2.0993 


4.9 


35 


129 


2.0210 


2.0960 


5 


36 


131 


2.0169 


2.0927 


6.7 


26 


111 


2.0344 


2.1016 


7.7 


26 


111 


2.0353 


2.1025 


8.2 


26 


110 


2.0405 


2.1076 



Table 4 

Comparison of proposed method under different thresholds (r) on Stock data using the proposed method. For 
definition of perplexity based on test likelihood and BIC scores, see (34), and (36). 



general trend observed is the improvement of the BIC score as the threshold decreases up 
till a certain point. The graphs learned using r = 5, 7.7 and 8.2 are shown in Fig.6, Fig. 7, 
and Fig. 8. Interesting connections between companies emerge. The latent tree structure 
in Fig. 8 captures many key relationships. In particular, the S&P index node has a high 
degree since it captures the overall trend of various companies. Companies in similar sectors 
and divisions are grouped together. For instance, retail stores such as "Target" , "Walmart" , 
"CVS" and "Home Depot" are grouped together. However, additional relationships emerge as 
the threshold is decreased and cycles are added. We observe that the first cycle that is added 
connects the various oil companies which suggests strong interdependencies and influence 
on the S&PIOO index. In addition, more cycles emerge when the threshold is decreased 
further. For instance, in Figure 6, we find a cycle connecting aviation company "Boeing" 
with "Honeywell" which is in aviation industry, but also additionally is in chemical industry 
and connects to companies such as "Dow Chemicals". Thus, as in newsgroup data, we find 
that companies in multiple categories lead to cycles in the underlying graph. 

Edge Density vs. Threshold r: . We now study the edge density (i.e, number of edges) in the 
initialization step of our method as a function of the threshold r for both newsgroup and 
stock data. Recall that the initialization step involves building a loopy graph on observed 
variables (and no hidden variables). The edge density in this step is indicative of number of 
cycles added to the ultimate latent model. We observe that the graphs become denser as r is 
reduced from rmax- However, when r is very small, the number of edges decreases since the 
nodes have sparser neighborhoods. This trend is seen in both : Figures 3a and 3b show the 
variation for newsgroup and stock data. For the newsgroup data, the graph density peaks 
at r = 5, which also achieves the highest predictive perplexity (see Table 1). Thus, we see 
a direct relationship between the edge density and the corresponding predictive perplexity 
in the learnt model. Intuitively, this is because as the number of edges increases, prediction 
at any node involves more evidence. However, as the threshold r is reduced further, graphs 
become less denser, and there is also a corresponding degradation in the predictive perplexity. 
The above experiments confirm the effectiveness of our approach for discovering hidden 
topics, and are in line with the theoretical guarantees established earlier in the paper. Our 
analysis reveals that a large class of loopy graphical models with latent variables can be 
learnt efficiently in different domains. 

6. Conclusion. In this paper, we considered latent graphical models Markov on girth- 
constrained graphs and proposed a novel approach for structure estimation. We established 
the correctness of the method when the model is in the regime of correlation decay and 
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Distance Threshold r 
(a) Newsgroup data. 



Distance Threshold r 
(b) Stock data. 



Fig 3. Variation of edge density of graphs at the initialization stage of LocalCLGrouping vs. threshold r. 

also derived PAC learning guarantees. We compared these guarantees with other methods 
for graphical model selection, where there are no latent variables. Our findings reveal that 
latent variables do not add much complexity to the learning process in certain models and 
regimes, even when the number of hidden variables is large. These findings push the realm 
of tractable latent models for learning. 
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Fig 4. Loopy Graph Learned using r — 9 with RegLocalCLGrouping on 20 newsgroup data. 
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Fig 5. Tree Graph Learned using r = 13 with RegLocalCLGrouping on 20 newsgroup data. 
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Fig 6. Loopy Graph Learned using r = 5 with LocalCLGrouping on S&iP 100 monthly stock return data. 




Fig 7. Loopy Graph Learned using r = 7.7 with LocalCLGrouping on S&iP 100 monthly stock return data. 
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Fig 8. Tree Graph Learned using r = 8.2 with LocalCLGrouping on S&iP 100 monthly stock return data. 



APPENDIX A: BACKGROUND ON LATENT TREE MODELS 

We first recap the results on latent tree models which will subsequently extended to 
more general latent graphical models. It is well known that tree-structured graphical models 
Markov on a tree T = (W, E) have a special form of factorization given by 



(41) 



p{^^)=\[PxXxd n 



p^ 



\Xi, Xj ^ 



i€W 



{i,J)(^T 



PxXxi)PxAx^ 



Comparing with general graphical models, we note that tree distributions are directly pa- 
rameterized in terms of pairwise marginal distributions on the edges. Similarly, a Markov 

— > 

model can be described on a rooted directed tree T with root r G W , where the edges of 

T are directed away from the root. Let Va.{i) denote the (unique) parent of node i ^ r and 
Pxi|Xpj,(i) denote the corresponding conditional distribution. The Markov model is given by 



(42) 



P{^w) = PxXxr) W Px.|Xp,(,)(a;i|a;pa(i) 



i€W,i^r 



A Markov model is said to be non-singular [40, 49] if (a) For all e E T, the conditional 

distributions satisfy < | det{Pxi\Xp^u-))\ < 1 and (b) For all i E V, PxX^) > for all x E X. 

— )• 
A non-singular Markov model on an undirected tree T and its directed counterpart T are 

equivalent [40, 49]. Note that non-singularity is equivalent to positivity (i.e., bounded po- 
tential functions) for Markov tree models. In particular, Ising models on trees with bounded 
node and edge potentials are non-singular. This is because under positivity, there is positive 
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b 
Fig 9. Quartet Q{ab\uv). See (44). 

probability for any global configuration of node states which implies that the conditional 
probability at a node given any of its neighbors cannot be degenerate. 

Latent tree models or phylogenetic tree models are tree-structured graphical models in 
which a subset of nodes are hidden or latent. Our goal in this paper is to leverage on the 
techniques developed for learning latent tree models to analyze a more general class of latent 
graphical models. 

A.l. Learning Latent Tree Models. Learning the structure of latent tree models is an 
extensively studied topic. A majority of structure learning methods (known as distance based 
methods) rely on the presence of an additive tree metric [23, 44]. The additive tree metric can 
be obtained by considering the pairwise marginal distributions of a tree structured graphical 
model. For instance, the work in [39] considers the following metric for discrete distributions 
satisfying the non-singular condition 

(43) rf(z,j):=-log|det(Px.J|, V^,j G V. 

By non-singularity assumption, we have that | det(Pxi )| > for all z, j G W'^. The distance 
metric further simplifies for some special distributions, e.g. for symmetric Ising models, it is 
given by the negative logarithm of the correlation between the node pair under considera- 
tion [48]. 

A. 1.1. Quartet Based Methods. A popular class of learning methods are based on the con- 
struction of quartets (e.g., [12, 25, 39]), and various procedures to merge the inferred quartets. 
A quartet is a structure over four observed nodes, as shown in Fig. 9. We now recap the clas- 
sical quartet test operating on any additive tree metric [23, 44]. The path structure refers to 
the configuration of paths between the given nodes. 

Definition 3 (Quartet or Four-Point Condition on Trees). Given an additive metric on 
a tree [d{i.,j)]i^j^v, the tuple of four nodes a,b,u,v & V has the path structure in Fig. 9 iff. 

(44) d{a, h) + d{u, v) < min((i(a, u) + d{h, v), d{b, u) + (i(a, f )), 

and the structure in Fig. 9 is denoted by Q{ab\uv). 

It is well known that the set of all quartets uniquely characterize a latent tree. In [25], it 
was shown that a subset of quartets, termed as representative quartets, suffices to uniquely 
characterize a latent tree. The set of representative quartets consists of one quartet for each 
edge in the latent tree with shortest (graph) distances between the observed nodes. 
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Algorithm 2 Quartet(d"'(y), A) test using distance estimates d"'(y) := {d{i,j)}ij£v and confi- 

dence interval A. 

Input: Distance estimates between the observed nodes d"{V) :— {d{i,j)}i.ji^v and confidence interval A. 
Initialize set of quartets Q{V) <— 0. 
for {J,j,J^',j'}GV^do 

if (e-'*(»'J) - A)+(e-^('''^'' - A)+ > (e"'*'''^'' + A) + (e-^':''^' + A)+ then 

Declare Quartet: Q{V) ^ Q{ij\i'j')- 
end if 
if No quartet declared for {i,j,i',j'} then 

-^i,j,i',i' (Declare null). 
end if 
end for 

Algorithm 3 RG(d"(F), A, r) test using distance estimates d"(y) := {d{i,j)}ij^Vi confidence 
interval A and threshold r for merging nodes. 

Input: Distance estimates between the observed nodes d"(V) :— {d{i,j)}ij^v, confidence interval A and threshold 

T. Let C{a) denote the children of node a. 

Initialize A^V, C{i) ^ {i} for alii e V and Q(V) ^ Quartet(d"(^), A). 

while A 7^ do 

if 3 i,j G A s.t. for each a £ C{i) and b G C{j), c,d ^ C{i) UC{j), {ac\bd, ad\bc} ^ Q{V), i.e., a,b are on same 
side of all such quartets in Q{V). then 

Declare i,j as siblings and introduce hidden node h as parent and C{h) ^— C{i) U C{j). 
Remove i,j from A and add h to A. 
else 

Sibling relationships cannot be further inferred. Break. 
end if 
end while 

Form forest T based on sibling and child/parent relationships. 
Merge edges in T of length less than r and output T. 

A. 1.2. Recursive Grouping. We recap the recursive grouping RG(d"(V), A, r) method pro- 
posed in [19] (and its refinement in [2]). The method is based on a robust^^ quartet test 
Quartet(d", A) given in Algorithm 2. If the confidence bound is not met, a ± result is de- 
clared. In the first iteration of RG, the algorithm searches for node pairs which occur on the 
same side of all the quartets, output by the quartet test Quartet(d", A) and declares them 
as siblings and introduces hidden variables. In later iterations of RG, sibling relationships 
between hidden variables are inferred through quartets involving their children. Finally, weak 
edges are merged and a tree (and more generally a forest) is output. We later use a modified 
version of recursive grouping method as a routine in our algorithm for estimating locally 
tree-like graphs. In the end, the neighboring nodes (at least one of which is hidden) are 
merged based on the threshold r. See Section 3 for details. 

A. 1.3. Chow-Liu Grouping. An alternative method, known as Chow-Liu grouping (CLGrouping) 
was proposed in [19]. Although the theoretical results for CLGrouping are similar to earlier 
results (e.g. [25]), experiments on both synthetic and real data sets revealed significant im- 
provement over earlier methods in terms of likelihood fitting and number of hidden variables 
added. 



Denote (•)+ :— max(-,0). 
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Algorithm 4 CLGrouping(d"'(y), A,r) for graph estimation using distance estimates d"(y) := 
{d{i,j)}ij£V: confidence interval A and threshold r. 

Input: Distance estimates between the observed nodes d"(V) :— {d{i,j)}ij^v, confidence interval A and threshold 
T. Let MST(y; d") denote the minimum spanning tree over V according to the metric d"(V). Given a tree T, let 
Leaf (T) denote the set of leaves. Let J\f[i; T] denote the closed neighborhood of node i in tree T. 
Initialize f ^ MST(1/; d"). 
for u G 1/ \ Leaf (f ) do 

S^RG(d"(A),A,r). 

f{A) <-~ S (Replace subtree over A with S in f ) 
end for 
Output f. 



The CLGrouping method is summarized in Algorithm 4. The CLGrouping method always 
maintains a candidate tree structure and progressively adds more hidden nodes in local 
neighborhoods. The initial tree structure is the minimum spanning tree (MST) over the 
observed nodes with respect to the tree metric. The method then considers neighborhood 
sets on the MST and constructs local subtrees (using quartet based method or any other 
tree reconstruction algorithm). This local reconstruction property of CLGrouping makes it 
especially attractive for reconstructing girth-constrained graphs. 

APPENDIX B: ANALYSIS OF ISING MODELS 

For Ising models, the regime of correlation decay can be explicitly established. Recall that 
Amax is the maximum degree of graph G and the maximum absolute edge potential is 6'max- 

Lemma 2 (Correlation Decay in Ising Models). The class of Ising models is in the regime 
of correlation decay when satisfies 

(45) a := Amaxtanh (6'max) < 1- 
The rate function Cm(") for correlation decay in (5) is given by 

(46) Cm(0 = 2a', V/ G N. 
Moreover, for assumption {A3) to hold, it is sufficient that 

a^/2 

(47) T^K^^TT)?? = ^(1)' 

min 

where g is the girth, O^^i^ is the minimum edge potential and rj := rfmax/c^min- 

Proof: The above result on correlation decay is based on the concept of self-avoiding walk 
trees (SAW), which converts the conditional distributions of a general model to those on 
a tree model. See [4] for details. Regarding simplification of (-B3) for Ising models to yield 
{A3), it is easy to see that g/2 — r/rfmin — 1 = 5'/4 + w(l) from the constraint on r in (14). In 
Section B, it is shown that v is dominated by the second term, and its dependence on 6'min 
is made explicit. □ 
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We now establish that the general distance bounds based on local tree approximation in 
(18) can be expressed in terms of potentials for the Ising model. Given an Ising model P with 
edge potentials 6 = {Oij} and node potentials = {(pi}, consider its attractive counterpart 
P with edge potentials 6 := {|6',;j|} and node potentials (j) := {|0i|}. Define the following 
quantity: 

(48) 0:,,,:=maxatanh(l(X,)), 

where E is the expectation with respect to the distribution P. Finally let Pxi 2;{6»,(^i,(^2} denote 
the Ising model on two nodes {1, 2} with edge potential 6 and node potentials (pi and (p2- 

Lemma 3 (Distance Bounds and Bounds on Edge Potentials). For an Ising model, the 
distance bounds in (18) are related to the bounds on edge potentials in (8) as follows: 



maxiVmaxiVmax/ 



(49) Cin> -log|detPxi.2;{9, 

(50) c/max < - log I det Pxi,2;{e,,in,0,0}|- 

The proof follows. We first consider marginal distribution on a tree model. 

Lemma 4 (Marginal Distribution on a Tree). For an Ising model {0, </>} on a tree T with 
edge potentials 6 = {Oij} and node potentials (p = {(pi}, the marginal distribution between 
any two neighbors {i,j) G T is given by an Ising model {9ij,(p[,(p'j} with edge potential 6ij 
and node potentials (p[ and 0'- given by 

(51) 0. = atanh(E[Xi; T_j]), (p'^ = atanh(E[Xj; T_i]), 
where T^j denotes the tree with node j removed. 

Proof: A more general version of this result is proven in [22, Lemma 4.1]. □ 

We also use the following result for attractive models (6'jj > 0, W {i,j) G G). 

Lemma 5 (Griffith's Second Inequality [30]). For two attractive Ising models Markov on 
same graph G = {W, E) with potentials < 9i^j < 9'^ , for all {i,j) G G, we have 

(52) E YlXiO <E JJXi;^' , ^UcW. 

.ieu J lieu 

In particular, this means that if the potentials of a model are increased, then the marginal 
expectation E[Xj] are also increased. This implies if some of the edge potentials are set to 
zero (meaning we take the model on a subgraph), E[Xj] is reduced. 

Finally we note a simple expression for information distance in a symmetric Ising model 
(with zero node potentials) on two nodes. 

Fact 1 (Symmetric Ising Model). For a symmetric Ising model on two nodes {1, 2} with 
edge potential 9 and zero node potentials, we have 

(53) d{l,2) := -log|detPxi,2;{e,o,o}l = -log|Ci_2| = -logtanh |6'|, 

where Gi^2 '■= IE[XiX2] is the correlation between the two nodes. 
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Proof: For a symnietric model, we have P{Xi = x) = 0.5 for i = 1,2 and x G {—1, +!}• 
Similarly P{Xi = +\X2 = — ) = P{Xi = —\X2 = +). Using these facts, the distance 
d{l,2) = —log |Ci,2|- The correlation simplifies to 

(54) Ci,2 := ^ (E[Xi 1X2 = +] - E[Xi IX2 = -) = tanh 9. 

D 
From the above fact, assuming Lemma 3 holds, c/max for Ising models is given by 

(55) rfmax < -log tanh 6'min- 

Proof of Lemma 3: For an Ising model Py^^2\{e,<i>i,<i>2} o'^ two nodes {1, 2} with edge potential 
6 and node potentials 0i, 02, we have 

(56) exp[-d(l, 2; {6, 0^, 02})] = | det P\ = ,,^ ^T'l^l^'l^ ,,^ ^,,2 - 

2 (e^ cosh(0i + 02) + e-^ cosh(0i - 02)) 

Without loss of generality, consider an attractive model {6 > 0). The above function is 
minimized with respect to {0i, 02} when 0i = 02 = since cosh(a;) > cosh(O) = 1. Similarly 
it is maximized with respect to {0i, 02} when 0i = 02 = 0max fo^( ^ > 0). We subsequently 
establish that it is the maximum allowed node potential. When 0i = 02 = 0, we can show 
that exp[— (i(l, 2)] is increasing in 6 and thus, the minimum is attained when 6 = ^min, and 
the maximum when 6 = ^max- 

From Lemma 4, the marginal distribution between two neighbors on a tree model is charac- 
terized. Only the node potentials at the two nodes are altered when the marginal distribution 
at the two nodes is considered. The (absolute) node potential at the two nodes is dominated 
by the attractive counterpart and cannot exceed 0max ^^ (51) from Griffith's property of 
attractive models in Lemma 5. □ 

Proof of Theorem, 1: From Theorem 2, we have structural consistency when n = Q{v~'^ logp), 
where v is given in (53). We have v = min(— 0.5e^''(e'^™'" — l),exp[— 0.5(iinax('^/'^min + 
2)]). When r is chosen as r = 6{ri + l)(imax + e, for some e > 0, we have that v = 
exp[-0.5dmax('^/c?min + 2)], whcu 6"'^'"='=' < 1/3. Usiug the fact that e"'^""'"' = tanh6'niin from 
(55), we have that tanh6'min < 1/3 holds when the maximum degree A^ax > 3 since the 
model is in the regime of correlation decay (-B3). Since we require minimum degree of three 
for identifiability of hidden nodes (-B1), this is satisfied, and we have the result. 

For the special case when all the nodes are observed {6 = 1), the sample complexity can 
be improved by selecting the parameter r > rfmax + e for some e > 0, and only building 
local MSTs, and considering their union. In this case the sample complexity is given by 
n = f2(e^'' logp) which reduces to n = Q(9^^ logp). □ 

APPENDIX C: STRUCTURAL CONSISTENCY OF LOCALCLGROUPING 

We first establish that the LocalCLGrouping algorithm proposed in Section 3 recovers the 
unknown latent graph correctly when statistics corresponding to the tree limit are input. In 
Section C.2, we then establish that distances based on exact statistics converge locally to 
their tree limit. Finally, we consider sample-based analysis in Section C.3, and use standard 
concentration results, along the lines of [26, Section 6], and thereby proving Theorem 2. 
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C.l. Correctness of LocalCLGrouping under Local Tree Metric dtree(^)- Recall that 
dtree(^) := {(i(z, j; tree) : i,j G V} is given by 



d{i, j; tree) := — log | det Px^ 



tree(i j) | ; 



where Px^ ■|tree(ij) denotes the distribution at nodes i and j by limiting the model to the 
induced subgraph tree(z, j). tree(i, j) := G{Bi{i) U Bj{l)) for / = [g/2\ — 1 and g is the girth 
of the graph. Since tree(i, j) has no cycles, it immediately follows that dtrceiV) is a tree 
metric. 

Fact 2 (Local Tree Metric). The distances dtroe(^) form an additive tree metric. 

We now establish that the LocalCLGrouping algorithm proposed in Algorithm 1 outputs the 
correct graph under the assumptions on Theorem 2 when a local tree metric, computed using 
acyclic neighborhood subgraphs according to (16), dtree(^) are input to the algorithm. Note 
that in practice, we only have access to empirical estimates d"'(y) of the distances d{V), and 
not dti.ee(^)- In Section C.2, we establish the local convergence of d{V) to dtree(V^) under 
correlation decay. 

C.1.1. Recap of CLGrouping for Learning Latent Trees. We first recap the result from [19, 
Lemma 8] that relates a latent tree model with the minimum spanning tree over the observed 
nodes according to a tree metric. Note that in this case, d{V) coincides with dtree(^)- For 
every node i & W in the latent tree T, define a mapping Sg : W i-> V , termed as surrogate 
mapping, as follows: 

(57) Sg(2; d) := argmin(i(i, j; T), V,2 G W. 

j&v 

Thus, observed nodes V are their own surrogates while the hidden nodes H are mapped to 
the closest observed node according to metric diV). See Fig. 10 for an example. 





3 4 5 3 4 5 

(a) Latent Tree Model (b) Minimum Spanning Tree 

Fig 10. A latent tree model over T = {W,E) and the corresponding minvmum spanning tree MST(V; d) over the 
observed nodes V C W . The observed nodes are shaded while the hidden nodes are unshaded. The thick lines in 
Fig. 10a represent the edge between a hidden node and its surrogate. See Lemma 1. 



Proposition 1 (Relating Latent Tree and MST). Given a latent tree T = {W,E), set 
of observed nodes V G W and a tree metric d{V), the minimum spanning tree MST(V^; d) 
over the observed nodes satisfies the following properties: 
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1. The MST(V^; d) is obtained from the latent tree T by merging each hidden node h & H 
with its surrogate Sg(/i; d) and viceversa. 

2. Let ^ denote the maximum graph distance between a hidden node and its surrogate in 
the latent tree T and let 6 denote the depth of tree T. We have 

(58) e < 6^p^^, 

dram;T 

where c?min;T and d^^^x^T are bounds on the distance in T. 

C.1.2. Union of Local MSTs under LocalCLGrouping. Using the results of CLGrouping, we 
establish properties of the union of local minimum spanning trees for girth-constrained graphs 
under correlation decay. To this end, consider the choice of parameter r in (24) and bounds 

C?min;tree and rfniax;tree- AlsO define 

(59) ^'^=L;r^— J> ^"^=r;r^— ]• 

'^max;tree '^min;tree 

Recall that Br{i] dtrce) denotes the set of observed nodes within distance r according to the 
metric dti.ee(V^)- Let Br'{i; G) denote the set of nodes (including hidden nodes) within graph 
distance r' from node i & V on graph G. By definition, Br'{i] G) C Br{i] dtrec) C Br"{i; G). 
In other words, the nodes in Br{i; dt^ee) have graph distance at least r' and at most r" . We 
have the following result. 

Lemma 6 (Properties of Union of Local MSTs under dtree(V^))- The graph formed by the 
union of local minimum spanning trees {G' := Ui^^y MST(i?r(05 dtree)) under LocalCLGrouping 
method using the distance metric dti.cc(^); when the parameter r is chosen according to (24), 
satisfies the following properties: 

L G' does not contain triangles. 

2. G' is formed by contracting each hidden node h E H to its surrogate node Sg(/i; dtrce) 
(according to the distance metric (6)). 

Proof: The first result is easy to see. We have that for each edge {i,j) G G', d{i,j; tree) < r 
since the MSTs are formed on nodes within distance r. By contradiction, assume that a trian- 
gle exists between nodes i, j,k G V in G'. This implies that d{i, j; tree), d{j, k; tree), d{k, i; tree) < 
r. For a triangle to exist, we require another node / G V such that d{j, /; tree), d{j, k; tree), d{k, I; tree) < 
r. See Fig.lL Since the maximum graph distance between any two nodes i,j satisfying 
d{i,j] tree) < r is r", we have that the maximum length of the cycle containing i,j,k,l is 
4r". When 4r" < g (which holds for r according to (24)), such a cycle cannot exist and such 
triangles cannot occur in G'. 

For the second result, from Fact 2, the distances dtree(-Br"(^) ^)) form a tree metric when 
2r" < g, where g is the girth of the graph G, which holds for the choice of r in (24). This 
implies that Proposition 1 is applicable and the minimum spanning tree MST(i?r(f); d)) is 
formed as a result of contraction of hidden nodes to their surrogates. When the parameter ^ 
in (58) satisfies C,+S < r' (which is true under (23)), then every hidden node has a surrogate 
within some local neighborhood Br{v) and forms a quartet with its surrogate node. This 
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Fig 11. Condition for existence for triangles in {G' := Uigy MST(_Br(i); dtroo))- 

implies that every hidden node h E H contracts to its surrogate node in some local MST. 

n 
Proof of Theorem 2 under dtreel^) ■' We now show that the method LocalCLGrouping correctly 
recovers the graph G when tree-based distances dtree(^) are input under the assumptions 
of Theorem 2. From Lemma 6, we have that in the graph formed from the union of local 
MSTs {G' := U^igy MST(i?r(«); dtree)), each hidden node is contracted to its surrogate node. 
The method LocalCLGrouping proceeds by reversing these contractions by considering neigh- 
borhoods on G' and constructing a local latent tree. Since there are no triangles in G", the 
construction of local latent trees are independent. From the correctness of CLGrouping de- 
veloped in [19], the local latent trees are correct since the distance metric converges locally 
to a tree metric. Thus, the correctness of LocalCLGrouping under dtree(^) is proven. □ 

Proof of Theorem 2 with samples: Combining Lemma 9, Lemma 9, and Lemma 12. □ 

C.2. Local Convergence to a Tree Metric. We have so far analyzed the performance 
of LocalCLGrouping algorithm under tree-based distances dtree (^)- We now relate the dis- 
tances d{y) computed using exact pairwise statistics with dtree (^) under correlation decay 
according to (5). Let 

<ax(0 := ^rfmax;tree " log(l - c''^— | A'|C™(^/2 - / - 1)), 

where (imax;tree IS the maximum (i(z,j; tree) for any two neighbors z,j on graph G according 
to (18). 

Proposition 2 (Local Convergence to a Tree Metric). When a discrete graphical model 
satisfies correlation decay with rate Cm(-) according to (5), we have a.a.s., for nodes i,jEW 
with graph distance I in G and I < g/2 — 1, 

(60) |exp[-rf(2,j;G')]-exp[-rf(^,j;tree)]| < |A'|C„((7/2 - / - 1), 

where g is the girth of the graph, and \X\ is the cardinality of the random variable at each 
node. Additionally, we have 

(61) \d{z,j-G)-d{z,j-tTee)\ < |A'|e<-(\^((7/2 - / - 1). 
Proof: From the definition of correlation decay in (5), we have that 

||-Px„|G - -Px,j|tree(jj)||i < Cm(5'/2 - / - 1), 
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since tree(i, j; G) := G{Bygj2\-i{i) U -B[g/2j-i(j)) and g/2 — / — 1 is the distance from i and 
j to the nearest boundary. 

From [7, Sec. 20], we have that for any k x k matrix A, 

(62) I det(A + E)- det(A)| < A;max{||A||^, \\A + ^l|,f "i^H^. 
Thus, we have that 

|det(Px,„|G) - det(Px,„|tree(„-))l < l'^III^X,.,|G " i^X„, | tree(„-) || i < |^|Cm(^/2 - / - 1). 

From Lipschitz continuity, we have that 

\d{t,j;G) - rf(^,j;tree)| < e'^— (')|det(Px,,|G) - det(Px,,|tree(i,i))|, 

Let (imax;G(0 t>e the maximum d{i,j]G) for any two nodes i,j at graph distance I, and 
similarly for c?niax;tree(0- Since no cycles are encountered in tree(z,j), d(i,j;tree) is a tree 
metric and thus (imax;tree(0 = ^'^max;tree(l)- For d^^^-cQ), we uote that 

D 
Remark: When 

(63) e"'--''-\X\Cn.{g/2 - / - 1) = o(l), 
then 

(64) \d{i,j- G) - d{t,j; tree)| < lA-le'""— +°(i)Cm(^/2 - / - 1) = o(l), 

C.3. Sample-Based Analysis. 

C.3.1. Concentration of Distance Estimates. We firs^ derive the concentration bounds for 
distance estimates along the lines of from [26, 39]. Let d'^{V) be the estimated distances using 
n samples according to (6). We first recap the following result on empirical distribution [50, 
Thm. 2.1]. 

Proposition 3 (Guarantees for General Empirical Distribution). The following is true 
for the empirical distribution P", obtained using n i.i.d. samples from a discrete distribution 
P: 

(65) P[||P" - P||i > e] < 2*^exp[-neV2], 

where k is the dimension. 

Given a graph G, let the graph distance between two nodes i and j under consideration 
on graph G be /. Recall that lA"! is the dimension of the variable at each node. 
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Lemma 7 (Concentration of Empirical Distances). For empirical distance between node 
i and j at graph distance I, computed according to (6) using n samples, we have the following 
result: 



(66) 



P 



exp[-d(i, j; G)] - exp[-d(i, j; G)\\ > e 



< 21-^1 exp 



ne 



2 1 



2|A'| 



When e > \X\'^(m{g/2 — I — 1) and I < g/2 — 1, we additionally have that 
(67) 



P 



exp[—d{i,j;G)] — exp [—(i(i,j; tree)] | > e 



< 2l^l exp 



n 



2\X\ 



\y\'^(- 



l-V 



Proof: Along the lines of Proposition 2, using [7, Sec. 20], we have, 



(6J 



P 



\detiPlja)-detiP^,jG)\>e 



<2W 



exp 



ne 



2\X\ 



and thus (66) holds since d{i,j; G) := —log | det(Pxij|G)|- Using Proposition 2, we also have 
(67). ' □ 

C.3.2. Recap of Sample Analysis o/CLGrouping for Learning Latent Trees. We also recap the 
result [19] that the minimum spanning tree (MST) constructed over observed nodes under 
CLGrouping method is consistent when the underlying model is a latent tree. 

Recall that p := \V\ is the number of observed nodes and n is the number of samples. 
Let ?7 be the maximum graph distance (with respect to the latent tree T) between any two 
neighbors in MST(V, d) and (imin, (imax are distance bounds on the edges of the latent tree 
T. 

Lemma 8 (Consistency of MST using CLGrouping for Latent Trees). Given a latent tree 
T = {W, E) and observed node set V C W , the MST constructed by CLGrouping method using 
empirical distances d^iV) does not coincide with the true MST based on exact distances d(y) 
with probability 



P 



MST(V;d") ^ MST(l^;d)l < 2l'^l+y exp 



n 



8 kf 



-2r)dn 



'l-e 



Proof: From the property of the MST, 



P 



MST(l^;d") 7^MST(y;d) 



(a) 



P 



u 



g -d{u,v) ^ ^-d{i,j) 



(b) . 
< p' 



(c) 



(i,j)eMST(V;d) 
. ii,j)ePa.th{u,v) 



max F \euv — ^i i > e 

(Jj)GMST(y;d) '- ' 



-dlij) _ ^-d{u,v)l 



{i,j)ePa.th{u,v) 



< p^ max P [eu,v - Cij > e '''^■"="'(1 - e '^'"'")] 

i,j,U,v€:V 
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(d) 

< 2p^ max P 

u,vGV 



> 



'1-e 



'^min 



where eu,v '■= exp[—d{u,v)] — exp[—d{u,v)] and similarly for eij. Equality (a) is due to the 
property of the MST, inequality (b) is the union bound, inequality (c) is obtained by applying 
bounds on d{i,j) and d{u,v): 



-d{i,j) 



^~d(u,v) ^ g- 



-'^(«.i)n _ ^d,{i,j)-d{u,v)\ ^ Q-nd^^ 



'^min 



since d{i,j) < rjd^s^y^ for all (z,j) G MST(F; d) and d{i,j) — d{u,v) < —d^i^ for all (z,j) G 
MST(l^;d) and u,v E V containing (i,j) on the path connecting them. Inequality (d) is 
obtained from the fact that eu,v — ^i,j > 2max(eu,„, ej^) and applying the union bound. The 
final result is from (68) in Proposition 7. □ 

C.3.3. Sample Analysis of Union of MSTs under LocalCLGrouping. We now establish consis- 
tency under LocalCLGrouping algorithm using the above result and local convergence of the 
metric d{V) to tree-based metric dtree{V), according to Proposition 2. Recall that d"(V^) 
denotes the estimates of the true distances d{V) according to graph G. Let dtreel^) denote 
the distances by considering only acyclic neighborhood subgraphs, defined in Proposition 2. 
Given empirical distances d"-(y) and tree distances dtree and parameter r according to (24), 
for each i G V, let A, := Br{r, d"). Define £ := N n {r / d^,^, g / 2) . 

Lemma 9 (Union of Local MSTs under LocalCLGrouping: I). Given a graphical model 
Markov on graph G = {W, E) satisfying conditions of Theorem 2 with observed node set 
V C W , we have 



P 



U MST(1,; d") ^ U MST(1,; d^^ee) 



ieV 



(69) 



< 2' 'p^min I 2)0 exp 



+ exp 



n 



2\XV 
n 



iev 



0.5e-7e'^-- 



\a:\^CU^-i-i) 



2\X\ 



{ldmin-r-\XfCm{^-l-l)y Y 



Remark: In the high- dimensional regime, where p — t- 00, the first term dominates. Since 
Cm(-) is monotonically decreasing, we can choose / = [r/dmin] + 1. Roughly, we require 
n = r2(e'') when the other parameters are bounded, for the error probability to decay. 

Proof: Along the lines of Lemma 8, for each A; G K, we have 



P 



=P 



MST(Afe;d")^MST(Afc; dtree) 



u 



(Jj)eMST(Afe;dtrco) 
{i,j)&ath{u,v) 



^-d{u,v) ^ ^-d{i,j) 
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^ \yu.v i^ 



-d{i,j;tree) —d{u,v;tTee)~\ 



<p max 

(i,i)eMST(Afe;dtreo) 

(iJ)gPath(u,v) 
(a) 

<p^ max P [eu,v - e^j > (e"'' - eij){l - e"'^""" 

i,j,u,v£V 
(b) 

<2p^ maxP 

u.vEV 



^u,v > — (e''-'" - 1) 



where e„,^ := exp[—d{u,v]G)] — exp[— (i(u, f ; tree)] and similarly for ejj. Inequality (a) is 
obtained by applying bounds on d{i,j) and d{u,v): 



-d{i,j) _ ^-d{u,v) ^ ^-d{i,j) 



:i-d 



d(i,j)~d{u,v)^ 



> (e "-eij)(l-e- 



since d{i,j) < r for all (i,j) G MST(V";d), e-'^^*-^) > e"^ - e^j and d(i,j) - d{u,v) < 
— (iinin for all {i,j) G MST(K; d) and M,f G V^ containing (i,j) on the path connecting 
them. Inequality (b) is obtained from the fact that eu,v — e~'^™'"ejj > 2max(e„,„, e~'^™'"eij) > 



2e ''™'"max('e„^„,ejj) since e '^™'" < 1. 



Now define /max, the maximum graph distance between any two nodes in any A^, i.e., 

/max := max (Diam(MST(Afc)) 



where Diam(-) is the diameter, in terms of graph distance on G. From (67) in Lemma 7, 
conditioned on {/max = /} and union bound on k & V, 



P 



(70) 



< 2l^l+Vexp 



|jMST(l,;d") ^ |jMST(l,;dt,ec)|{/max = /} 

(0.5e-^(e'^--l)-|A'|2C„(|-/-i; 



2 kf 



We now derive characterize the event that {/max = /}• Note that /max < d'^^/dmin, where 
(71) 

Thus, we have 



c^ma^:=maxmaxt/(i,j). 



P 



'max ^ ^ 



< 2l^lp^exp 







" 




< p 


y d{i,j) > Id^in 
kev 




:p 


2 


72 / 

-win I ^Wmin ^ '^- 


PC 



/-r 



n 



We now provide conditions when IJiey MST(ylj; dtree) coincides with IJjgy MST(Aj; dt^ee) 
where Ai := Br{i] dtree)- 
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Lemma 10 (Union of Local MSTs under LocalCLGrouping: II). Given a graphical model 
Markov on graph G = {W, E) satisfying conditions of Theorem 2 with observed node set 
V C W , we have 



P 



U MST(1,; dt,ee) ^ U MST(A,; d^ree) 



i£V 



iev 




r - lAflV (- -rid 



r - S(^-^ + l)rfmax - |'^PCm(^ - r / d„ 



Proof: Define 

(72) 

(73) 



/. 



max fDiam(MST(lfc)) 

k \ 

min fDiam(MST(lfc)) 

k \ 



where Diani(-) is the diameter, in terms of graph distance on G. Conditioned on the event 
{^max < 4} n {^min > ^ + f^}, the graph satisfies the properties listed in Lemma 6 and thus, 



(74) P 

Moreover, 



U MST{Af, dtree) 7^ |J MST(A,; dtree) {Lax < f } H {L„ > ^ + '^} 



Hev 



iev 



0. 



P 



{^max > t} U {^min < ^ + 5} 



< 2l^lp^ exp 



n 



2\X\ 



gdn 



-r - \X\'^C (- -rid 



+ exp 



n 



2\X\ 



[r-{^ + 6)d^^^ - \X\'^(^{- - r/rfmin - 1) j j 



where ^ := Sdmax/ drain is the worst-case graph distance between a hidden node and its 
surrogate in G with respect to metric dtree- This is because the worst-case distance in a 
quartet containing a hidden node and its surrogate is (^ -|- 5)(imax- When the empirical 
version of this distance exceeds r, then we have a bad event. □ 

C.3.4. Analysis of the Recursive Grouping. Recall that for each i & V, let Ai := i?j.(i;d") 
and Ai := Bj.{i] dtree)- In LocalCLGrouping, the recursive grouping procedure is run on subsets 
of nodes in each A^. We first analyze the performance of quartet test. 

Lemma 11 (Analysis of Quartet Test). Given distance estimates d"'(y4j) over observed 
nodes in Ai, for each i eV , Quartet(d"'(y4j), A) returns the correct set of quartets (and no 
null results) with probability at least 

P[U,ev'{Quartet(d"(li),A) ^ Quartet(dtree(^^), A)}] 
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<2l'^lp^ (pexp 
(75) + exp 



n 



2\X\ 



(^exp[-(r/rfmin + 2)c/max/2] - \P^?Cm{^ - r/d^i^ - I] 



n 



I Wniin \^ \ >>m\ ^ '"/Ctmin j] ] 



when A is chosen as 

(76) A = exp[-(r/rf^in + 2)d^,j2]. 

Proof: For each quartet Q{viV2\v3Vi) under metric dtree(^) and A := IJi=i ^j' ^^ have that 

(77) P[Quartet(d"(^), A) y^ Quartet(dtoee(^), A)| p| {\d"{a,b) - d(a,6;tree)| < A}] = 0, 

a,beA 

and the test Quartet(d"'(^), A) does not return null when A < exp[— maxa^bg^ d{a, b; tree)/2]. 
Considering all sets A^ for i G V, we require A < exp[— /maxC^max/S] to not return null, where 



/max := max miam(MST(Afc)) j . 
From Lemma 7, choosing A = exp[— (/ + l)d^ax/2] we that 



P 



Uiey{Quartet(d"(Ai),A) ^ Quartet(dtree(A), A) {/^ax < 0} 



< 2l'^l/exp 



n 



2\X\ 



(exp[-(/ + l)rfmax/2]-|A'|2C„ 



i-r 



Along the lines of analysis in Lemma 10, we have that 



P 



X'max -^ '' J 



^L3, 



< 2''^'j9^exp 



n 



2\X\ 



'Wjnin T \'^ \ >tm\ i^ ^ J- 



Choosing / as r/dmin + 1, we have the result. 

This yields the following result on recursive grouping. 



D 



Lemma 12 (Results for Recursive Grouping) . The recursive grouping method RG(d(Aj), A, r) 
returns the same tree as RG(dtree(^i), A, r) with probability 



:ey{RG(d"(A,),A,r) ^ RG(dtree(A,), A, r)}] 

"rnin 



<2l'^lp^ pexp 



2\X\ 



exp 



\^ I Sr) 



2 dr, 



-11 



(7J 



+ exp 
+ exp 



n 



'-''min I "^ I Sm \ „ 



9 



2 dn 



2|A'|2 

f^ /"min |V|2/- {9 1^ 
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when A is chosen as 

(79) A = exp[-(r/d^i, + 2)d^^/2]. 
and T is chosen as 

(80) r=^-\X\\U9/2-l). 

Proof: Along the lines of analysis in [2], given the correct set of quartets, the recursive 
grouping procedure returns the correct tree structure when the nodes are merged correctly 
with threshold r. It is easy to see that this happens with probability 



2l^L2 



p exp 



2X 



{d„,,j2-\XWra{9/2-l)f 



when the threshold is chosen as (80). □ 

Proof of Theorem 2: From Lemma 9, Lemma 10 and Lemma 12. □ 

C.4. Analysis Under Uniform Sampling. Proof of Lemma 1: Let A{e) denote the event 
that an hidden edge (with at least one hidden end point) has a representative quartet in 
which the end points are at most graph distance / < g/2 — 1. We have that 



(e)] < 4(1 - p) 



(A„,in-1) 



!-l 



5 



since there are at least (Amin — 1)'^^ nodes in each of the four subtrees from which four 
observed nodes can be sampled and p := p/m is the sampling probability. Taking the union 
bound, we have the probability that the depth 6 is greater than / < g/2 — 1 as 

P[5>/]<4mA^a,(l-p)(^--i)'. 
Thus, the result in (27) holds. □ 

APPENDIX D: NECESSARY CONDITIONS FOR GRAPH RECONSTRUCTION 

Proof of Theorem 3: The proof is based on counting arguments along the lines of [11, Thm. 1]. 
For any deterministic estimator Gm, let TZ := Gmii'^"^ )") as the range of the estimator G 
when the number of observed nodes is |y| = m^ for /3 G (0, 1]. Thus, we have \TZ\ = \X\"' 

For any fixed graph Fm and set of labeled nodes V, denote the set of graphs within graph 
distance em as 

V{F^; em) := {G^ : dist(F^, G^; V) < em}. 

We note that 



P 



\V{E^;em)\ < m! ( "^ j < m^^^+^^^S"'", 



since we can permute the m vertices and change at most em entries in the adjacency matrix 
Air and we use the bound that (^) < ff < N''3''. 

Let S{Gm'i e^™) denote all the graphs which are within edit distance of em of the graphs 
in range TZ. We have that 

\S{Gm]em)\ < |A'|"'"V(2^+^)"^3^'". 
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Along the lines of [11, Tlim. 1], we have that the probability of error should satisfy 

\S{Gm;em)\ 



P[dist(G^,G^;l^)>em]>l 



ISMi 



where |S(?^)| is the number of graphs in the family under consideration. 

From [4, Lemma 2], we have that for girth-constrained ensembles Scirthl"^; 9, Amin, Amax, k) 
with girth g, minimum degree Amin, maximum degree Amax and number of edges k, we have 



(81) m\m-gAU' < IScirthim; g, A^.^, A^..,k)\ < m\m - AlJ\ 

and we have the result. □ 
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